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Small vs. Large 

Software-Development Projects 



Small Projects (<$1 M) 



on time, on budget 


challenged (late, 
over budget, 
insufficient 
functionality) 

■failed (cancelled 
prior to completion 
or delivered and 
never used) 


Large Projects (>$10M) 



Do Big Project via Service-Oriented Architecture + Many Small Services! 

CHAOS MANIFESTO 2013 Think Big, Act Small, www.standisharoup.com . 

Based on the collection of project case information on 50,000 real-life IT 
environments and SW projects. Surveying since 1985. 
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CEO: Amazon shall use SOA! 



1. 

2 . 

3 . 

4 . 

5 . 


(2002, 7 years after started companyjOl 

“All teams will henceforth expose their data and 
functionality through service interfaces." 

“Teams must communicate with each other 
through these interfaces." 

“There will be no other form of interprocess 
communication allowed: no direct linking, no direct 
reads of another team's data store, no shared- 
memory model, no back-doors whatsoever." 
“Service interfaces must be designed from the 
ground up to be externalizable. That is to say, the 
team must plan and design to be able to expose the 
interface to developers in the outside world. " 
“Anyone who doesn't do this will be fired." 
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Internet APIs + Uses 



WASAPI 

Credit: Alexis Rossi 
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The Big Picture 


• Ingest: 

o Sharing Crawlers, Capturing renderings, Deduplication 
o Divide/Conquer crawling, Soft Errors, Metadata 
extraction 

o Crawl management 

• Preservation: 

o Detect/repair damage, advertise holdings 

• Dissemination: 

o Memento, federated browsing, text & metadata search 
o Bulk access, format migration, data mining 
o Emulation 
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Landscapes & WASAPI 




Jefferson Bailey, Internet Archive (@jefferson_baii) 


WASAPI 




Growth in Web Archiving (NDSA & Archive-It) 
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FIGURE 4: YEAR INSTITUTIONS BEGAN ARCHIVING WEB CONTENT 
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Local Preservation of Web Archives 


Recent Surveys of local preservation of 
web data 

• NDSA: 1 8%-20% (201 1, 2013, 2016) 

• AIT: 20% of respondents (2016) 

• Reasons include 

- No local preservation plan 

- Trust in service 

- Doesn't integrate with existing 
workflows 

- Too much data 


n 2011 

■ 2013 



Building our in-house infrastructure No place to store Not sure what we would do with it 

FIGURE 15: REASONS FOR NOT TRANSFERRING DATA FROM AN EXTERNAL SERVICE 
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Community Involvement in WA Development 


• Few coordinated efforts on shared tools 


May 10, 2009 -Apr 2, 2016 

Contributions to master, excluding merge commits 


Contrtbutioi s Commits » 


• Historical reliance on few providers 

• Variance of coordination on emergent 
efforts & foresight on interoperability 

• Few on-ramps for not-dev participation 

• Yet some collaborative digital library 
efforts have proven successful 

• Emergence of broader web archiving 
community of practice 
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Other Challenges 


• Web Archiving often still a niche 
collecting activity 

• Use largely TBD or not measured 

• Convenience of end-to-end 
services diminishes tech needs 

• Little familiarity with formats, 
software, or processes 

• Nascent community impetus to 
join or advise on broad technical 
development activities 
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• WaybackAPIs 

• Archive-It Partner 
Metadata APIs 

• Data Analytics APIs 
(crawl logs and reports) 

• Index (CDX) APIs 

• Upload APIs (non-web) 

• Internal APIs 

https://github.com/ArchiveLabs/api.archive.org 


WASAPI 



WASAPI: Web Archiving Systems APIs 


• "Systems Interoperability and Collaborative 
Development for Web Archives" 

•National Leadership Grant, National 
Digital Platform, R&D 

•IA/AIT (PI), Stanford, UNT, Rutgers 

•2-year project started January 2016 

•National Symposium Early 2017 
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WASAPI: Web Archiving Systems APIs 


Three Key Areas of R&D: 

1 ) What are the attributes of a community model that can 
support sustainable and broad-based collaborative web 
archiving technology development? 

2) What are the community needs and downstream uses 
for the planned Export APIs (by AIT & LOCKSS) to 
facilitate transfer of web archive data between 
distributed systems and what other prospective APIs 
does it point to? 

3) How can better interoperability of web archiving 
systems support new forms of access and research use? 
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WASAPI: Web Archiving Systems APIs 


Outcomes: 

1) Seed & launch a community modeled on the 
characteristics of successful development and 
participation communities ID'ed by project 

2) Build WARC & derivative dataset APIs (AIT & LOCKSS) 
and test via transfer to partners (SUL, UNT, Rutgers) 
to enable better distributed preservation and access 

3) Sketch a blueprint and technical model for future 
web archiving APIs informed by R&D 

4) Seed a technical infrastructure that will facilitate 
more computational and distributed research use of 
web archive collections 
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WASAPI Technical Working Group 
and Current Progress 
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related API work 


• CDX Server API (IA, IIPC) 

• derivative formats (Archive-lt, BL) 

• crawl logs/partner data (Archive-lt) 

• Wayback Machine APIs (IA) 

• proliferating capture tools (GWU, IA, Rhizome) 

• Cobweb (CDL, Harvard, UCLA) 



use cases 


Archive-lt — > 

- partner IR/local use 

- DPN 

- LOCKSS (PLN) 
•CDL — > Archive-lt 

(migration) 

• DLSS -> IA 
(WebBase) 


• [EoT partners] « > 

[EoT partners] 

• IA global Wayback— ► 

- LOCKSS (OA content) 

- national libraries 

• LOCKSS (.gov) — > IA 

• [any web archive] — ► 

- researcher 

- original publisher 


data exchange b/t repositories 



local repository 


preservation network 


standardizing researcher data access 




service provider 



local repository 
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data exchange within repositories 



candidate features discussed 


content negotiation for W/ARC or derivatives 

protocol negotiation for transfer handoff 
ability to specify parameters for custom export 
metadata for provenance, crawler 
configuration, crawl logs, description 
request custom data extraction 
authentication + privileges management 


export API example 

• authentication 

• initiate transfer 

- (system tracks permissions) 

- (transfer files) 

• submit institution ID 

— (acknowledge transfer 

- return associated collection IDs 

• submit collection ID(s) 

- return associated job IDs 

• submit job ID(s) 

- return associated W/ARC files 

• submit candidate W/ARC files 

- return supported protocols 

completion status) 



THANKS! (discussion is next) 

WASAPI 


https://groups.google.com/forum/#!forum/wasapi-communitv 


https://github.com/WASAPI-Community 

https://www.imls.gov/sites/default/files/proposal narritive lg-71 - 

15-0174 internet archive.pdf 
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Discussion Questions 

What APIs have attendees built, or are currently using, 
in their web archiving activities? 

Are these APIs RESTful? If not, why not? 

What frameworks/languages were they built with? What 
are other notable characteristics of their development 
and maintenance? 

What part of the web archiving lifecycle would most 
benefit from next-stage API development, post-WASAPI? 



